Objective. Using Python, the goal of this project is to implement the k-means clustering algorithm, a technique often used in machine learning, and use it for data analysis. We write various functions using lists, sets, dictionaries, sorting, and graph data structures for computational problem-solving and analysis.


Part 1. Spotify API Data

Spotify is a popular audio streaming platform with an extensive music database. The Spotify API allows developers to access the platform’s data providing global insights into music listening habits around the world[1]. Using the API requires an initial setup involving several steps. These steps include registering as a Spotify developer, creating an app, modifying the dashboard redirect URI, and storing the client ID and secret. After completing the initial steps for setup, we have access to the Spotify API and all its features.

Get Playlist Data from API

First, we create a Client Credentials Flow Manager used in server-to-server authentication by passing the necessary parameters to the Spotify OAuth class[2]. We provide a client id and client secret to the constructor of this authorization flow, which does not require user interaction.

# Set client id and client secret
client_id = 'xxx'
client_secret = 'xxx'

# Spotify authentication
client_credentials_manager = SpotifyClientCredentials(client_id, client_secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

Now we can get the full details of the tracks of a playlist based on a playlist ID, URI, or URL. Choose a specific playlist to analyze by copying the URL from the Spotify Player interface. Using that link, the following code uses the playlist_tracks method to retrieve a list of IDs and corresponding artists for each track from the playlist.

for link in playlist_links:
    playlist_URI = link.split("/")[-1].split("?")[0]
    # Iterate over list of tracks in playlist
    for i in sp.playlist_tracks(playlist_URI)["items"]:   
        track_ids.append(i['track']["id"]) # Extract song id
        artist_ids.append(i['track']["artists"][0]["uri"]) # Extract artist id

Then, we write a function that takes the playlist data from the API and gets the metadata and audio characteristics of each track. Specifically, the function reads the query results for a playlist and returns the track name, track ID, artist, album, duration, popularity, artist popularity, artist genre, and audio characteristics for each track.

  • name: The name of the track.
  • album: The name of the album on which the track appears.
  • artist: The name of the artist who performed the track.
  • release_date: The date the album was first released.
  • length: The track length in milliseconds.
  • popularity: The popularity of the track calculated by an algorithm based on the total number of plays the track has had and how recent those plays are.
  • artist_pop: The popularity of the artist calculated from the popularity of all the artist’s tracks.
  • artist_genres: A list of the genres the artist is associated with.

Spotify Audio Features

Spotify’s audio features are precalculated measures of both low-level and high-level perceptual music qualities that help classify a track. As indicated by the Spotify website, a quick explanation of each feature is shown below. More information on how to interpret these audio features is located at Spotify’s API documentation.

  • acousticness: A confidence measure of whether the track is acoustic.
  • danceability: Describes how suitable a track is for dancing based on tempo, rhythm, beat strength, and regularity.
  • energy: A perceptual measure of intensity and activity.
  • instrumentalness: Predicts whether a track contains no vocals.
  • liveness: Probability that the track was performed live.
  • loudness: Overall loudness of a track in decibels (dB).
  • speechiness: Detects the presence of spoken words in a track.
  • tempo: Estimated pace of a track in beats per minute (BPM).
  • valence: A measure describing the musical positiveness. High valence sound more positive (e.g. happy, cheerful, euphoric).

The following code loops through each track ID in the playlist and extracts the song information by calling the function we created. From there, we can create a dataframe by passing in the returned data using the pandas package.

# Loop over track ids
all_tracks = [playlist_features(track_ids[i], artist_ids[i], playlist_ids[i]) 
              for i in range(len(track_ids))]
name album artist release_date length popularity artist_pop artist_genres acousticness danceability energy instrumentalness liveness loudness speechiness tempo valence
2 AM Pure Infinity SwaVay 2019-05-24 198577 54 51 [‘atl hip hop’, ‘indie hip hop’, ‘underground hip hop’] 0.434 0.783 0.341 9.85e-05 0.362 -12.353 0.0727 126.799 0.184
Golden Child Lady Wrangler Shaboozey 2018-10-05 177773 45 56 [‘pop rap’] 0.362 0.792 0.591 1.90e-06 0.360 -8.848 0.2900 151.029 0.365

Part 2. Similar Artists

First, we want to find the most frequently occurring artist in a given playlist. We use the value_counts function to get a sequence containing counts of unique values sorted in descending order.

# Count distinct values in column
tallyArtists = df.value_counts(["artist", "artist_id"]).reset_index(name='counts')
topArtist = tallyArtists['artist_id'][1]
artist artist_id counts
Juice WRLD 4MCBfE4596Uoi2O4DtmEMz 10
Post Malone 246dkjvS1zLTtiykXe5h60 8
SAINt JHN 0H39MdGGX6dbnnQPt6NQkZ 3

I can retrieve artist and artist-related data using the following code, passing the artist ID to the artist and artist-related artist functions under the spotipy package. The returned list of similar artists is sorted by similarity score based on the listener data[3].

a = sp.artist(topArtist)
ra = sp.artist_related_artists(topArtist)

Below is a sample of the result when we query Spotify for the most similar artists to the playlist’s top artist, creating a list that holds all of the artist source ids and target ids. We retrieve similar data for the nodes of the connection graph, creating a list that holds information for each specified artist.

source_name source_id target_name target_id
Post Malone 246dkjvS1zLTtiykXe5h60 Rae Sremmurd 7iZtZyCzp3LItcw1wtPI3D
Post Malone 246dkjvS1zLTtiykXe5h60 Huncho Jack 6extd4B6hl8VTmnlhpl2bY
Post Malone 246dkjvS1zLTtiykXe5h60 Tyla Yaweh 1MXZ0hsGic96dWRDKwAwdr

 

Let’s see how things look when we pull in the full dataset, with each of the artist’s top most similar artists and each of their most similar artists. The following visualization is based on the Spotify Similiar Artists API article and created with flourish studio.

Made with Flourish

Part 4. K Means Clustering

Next, we implement K-Means clustering using the Scikit-Learn library to break a large playlist into several smaller playlists. The first step is to define the K-Means function with k=3 clusters.

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 3)
kmeans.fit(df[columns])
## KMeans(n_clusters=3)
df['group'] = list(kmeans.labels_)

To further explore the data and decompose the audio features into a set of variables easier to visualize, we implement principal component analysis (PCA). We first scale the data as follows.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df[columns])
## StandardScaler()
scaled_data = scaler.transform(df[columns])

We then use a PCA instance that looks for two principal components determined from the data variables. From there, we visualize the two principal components and explore the variation.

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(scaled_data)
## PCA(n_components=2)
data_pca = pca.transform(scaled_data)

acousticness danceability energy instrumentalness liveness speechiness valence
1 0.5396875 0.737625 0.5372187 0.02251566 0.1513125 0.1588813 0.4577406
2 0.1193033 0.6751 0.6478 0.0002468313 0.1572567 0.1539933 0.5820333
3 0.1411991 0.6402273 0.5815455 0.0002416193 0.1652682 0.1366114 0.2303227

References

[1]
Web API Reference | Spotify for Developers, https://developer.spotify.com/documentation/web-api/reference/.
[2]
[3]
E. Webb, Visualizing Rap Communities with Python & Spotify’s API, https://unboxed-analytics.com/data-technology/visualizing-rap-communities-wtih-python-spotifys-api/.
[4]
Leonardo Mauro, Spotify Songs - Similarity Search, https://www.kaggle.com/code/leomauro/spotify-songs-similarity-search/notebook.